Unsupervised determination of lung tumor margin with widefield polarimetric second-harmonic generation microscopy

The extracellular matrix (ECM) is amongst many tissue components affected by cancer, however, morphological changes of the ECM are not well-understood and thus, often omitted from diagnostic considerations. Polarimetric second-harmonic generation (P-SHG) microscopy allows for visualization and characterization of collagen ultrastructure in the ECM, aiding in better understanding of the changes induced by cancer throughout the tissue. In this paper, a large region of hematoxylin and eosin (H&E) stained human lung section, encompassing a tumor margin, connecting a significant tumor portion to normal tissue was imaged with P-SHG microscopy. The resulting polarimetric parameters were utilized in principal components analysis and unsupervised K-Means clustering to separate normal- and tumor-like tissue. Consequently, a pseudo-color map of the clustered tissue regions is generated to highlight the irregularity of the ECM collagen structure throughout the region of interest and to identify the tumor margin, in the absence of morphological characteristics of the cells.

A total of 9 additional lung tissue regions across 2 additional sample slides, each belonging to different patients (3 patient samples overall, including the large extended region), were measured to validate the applicability of PC2 in tumor margin detection. Following the P-SHG analysis of the images, the computed polarimetric and texture parameters were used to form a linear combination corresponding to PC2 found from the large extended region presented in the manuscript. This procedure was performed by extracting the parameter coefficients of the PC2 from the large extended region, and subsequent application of the coefficients on the parameters found for the additionally imaged areas. Having recreated the PC2 for all imaged areas, K-Means was used to identify the binary, silhouette, continuous, and median filtered maps, as described in the manuscript. As such, the results may be used to directly validate the reproducibility of the PC2 findings in identifying the tumor margin.
Supplementary Fig. 1 shows the results of the analysis, similar to the format shown in Fig. 3 of the manuscript. It is clear that seemingly normal tissue regions are highlighted in yellow, while the areas in close proximity of tumor are depicted in magenta, corresponding to the trends observed in the large extended region presented in Fig. 3  Nonetheless, it is important to compare the clustering results with ground-truth labels, as Supplementary Note: Optimal number of sub-images for K-

Means
An extended large region of non-small cell lung carcinoma tissue was imaged with the widefield P-SHG microscope [4], and the polarimetric parameters were extracted. In order to generate highresolution maps of texture parameters, including contrast, correlation, entropy, angular second moment, and inverse difference moment [5], the underlying region of interest was first subdivided into sub-images. Furthermore, the polarimetric and texture parameters of subdivided area were analyzed with PCA and K-Means clustering to highlight morphological changes in the collagenous extracellular matrix, and ultimately identify the tumor margin. Therefore, it is important to assess the effect of subdivision level on the performance of the clustering algorithm.
To identify the optimal subdivision level for the analysis, polarimetric parameters in each of the 12 images used to tile the extended region were subdivided into 1, 4, 16, 64, 256, 1024, 4096, and 16384 sub-images. Following the subdivision, a binary K-Means clustering was applied and the silhouette scores of each individual cluster, along with the arithmetic and harmonic means of the silhouette scores were computed [6]. Supplementary Fig. 3a illustrates the resulting silhouette scores at all considered subdivisions.
It is evident that clustering with 1 or 4 sub-image per image, resulted in the larger silhouette scores across the board. However, when considering the percentage of clusters found in the data ( Supplementary Fig. 3b), it is evident that mostly a single cluster (cluster 2) is formed. Moreover, at such low subdivision levels, the details underpinning more than 50 million pixels in the extended imaged region are disregarded and mapping the features across the area is ineffective. Thus, we remove 1 and 4 sub-images per image from the list of potential subdivision levels.
It is clear that at subdivisions of 16 and larger, the percent cluster sizes are more comparable and the silhouette scores are more stable. As the subdivision level increases, the disparity between cluster sizes increases, and the algorithm once again effectively forms a single cluster. This effect is not reflected in the typically used arithmetic mean of the silhouette scores, as it tends to plateau around silhouette score of 0.4 ( Supplementary Fig. 3a). We notice that as the subdivision level increases, the performance of cluster 2 suffers, as indicated by the decreasing magenta curve, while the opposite is true for cluster 1. To fully capture this effect, harmonic mean of silhouette scores from clusters 1 and 2 were computed, similar to the F1-Score used to combine precision and recall in supervised machine learning. As depicted by the black curve in Supplementary Fig. 3a, the harmonic mean indeed penalizes the decrease in performance at larger subdivisions due to cluster size disparity, resulting in two optimal subdivision levels at 64 and 256 sub-images per image.
Since the aim of this investigation is to provide a detailed and high-resolution map of morphological changes of the collagenous extracellular matrix across the tissue, it is ideal to use larger subdivision.
As such, 256 sub-images per image were used to subdivide the data for further investigations.